Bioinformatics A Practical Guide to Next Generation Sequencing Data Analysis (Hamid D. Ismail)

Targeted Gene Metagenomic Data Analysis ◾ 265

generates another file format, with the “.qzv” file extension, called visualization file. This

visualization file is a standalone and sharable file that may contain any kinds of output such

as images, tables, and interactive representations. Plugin methods take QIIME2 artifacts

as input and produce an output. While a plugin visualizer produces a single visualiza-

tion file for the purpose of visualizing or sharing. In the following, we will show you how

to use QIIME2 to analyze targeted gene metagenomic data. The general workflow of the

amplicon-based metagenomic data analysis is shown in Figure 7.1.

7.3.1 QIIME2 Input Files

Metagenomic amplicon-based raw data may be acquired from a study conducted at the

laboratory or may be downloaded from a database. A study will have a design tailored to

address research objectives. The study design usually dictates the workflow of the analysis.

Whether raw data is from your own project or downloaded from a database, it usually

includes (i) raw sequence data and (ii) metadata (information of the samples and study

design). The analysis with QIIME2 requires importing these two inputs and converting

them into artifacts. Then, only the artifacts are the files that are used for the analysis.

Therefore, the first task is to import the raw sequence data and metadata into artifacts. The

QIIME2 artifacts have their semantic type that enables QIIME2 to identify the suitable

artifact for an analysis. In the following, we will discuss the raw sequence and metadata

with more details.

7.3.1.1 Importing Sequence Data

QIIME2 accepts inputs in a variety of file formats including FASTA, FASTQ files, and

feature files of OTUs or representative sequences. The FASTQ (single-end or paired-

end) files are the most commonly used. The reads in FASTQ files may be multiplexed or

demultiplexed. If they are multiplexed, they can either be multiplexed following the Earth

Microbiome Project (EMP) protocol (the barcode sequences are in a separate file) or non-

EMP (the reads are with in-sequence barcodes) [16]. On the other hand, the demultiplexed

reads can either be in Casava 1.8 format or not.

Some laboratories may have their own sequencer and others may depend on genomic

core facilities for sequencing. In either case, the raw data would be provided in one of the

above formats. Raw sequence data can also be downloaded from metagenomics databases.

Examples of metagenomic databases include NCBI SRA database available at “https://www.

ncbi.nlm.nih.gov/sra” and the EMBL-EBI-hosted MGnify database available at “https://

www.ebi.ac.uk/metagenomics/”. Both databases provide data generated from a variety of

microbiome studies on specific environment such as human body sites, soil, seawater, and

others. MGnify microbiome data can also be accessed from the NCBI SRA. Sequence data

like FASTQ files generated from those studies are stored in the NCBI SRA as sequence read

archives, which are compressed files and can be downloaded using SRA-toolkit.

Whether the input file for QIIME2 is FASTA file, FASTQ files, or feature file, it must be

imported by QIIME and converted into QIIME2 artifact. Importing an input file into an

artifact depends on the raw data file format; each file format is imported in a unique way.

But, in general, to import any input data, you must use “qiime tools import”. We already